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1 . 0  INTRODUCTION 


Reliability  is  defined  as  the  "probability  that  a 
component  part,  equipment,  or  system  will  satisfactorily 
perform  its  intended  function  under  given  circumstances, 
such  as  environmental  conditions,  limitations  as  to 
operating  time,  and  frequency  and  thoroughness  of 
maintenance  for  a  specific  period  of  time."* 

This  probability  continues  to  be  a  major  concern  in 
communications  oriented  systems.  Because  data  commu¬ 
nications  equipment  will  fail,  users  should  know  failed 
equipment  will  affect  network  operations.  Network 
failure  may  be  one  or  more  terminals  out  of  service; 
however,  network  performance  may  be  inhibited  to  the 
point  of  a  complete  system  shutdown.  Questions  emerge 
when  considering  reliability; 

a.  How  to  estimate  system  availability,  in  terms  of 

total  number  of  hours  per  month  that  terminals  can 
communicate,  in  order  to  judge  the  overall  perfor¬ 
mance  and  adequacy  of  a  network's  design,  product 
construction,  and  maintenance? 

*  From  McGraw-Hill,  Dictionary  of  Scientific  and  Technical 
Terms ,  c  197R,  pg.  ]  349  . 
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b.  What  are  the  chances  that  a  terminal  when  needed 
will  not  be  available  to  run  a  job? 

c.  What  is  the  probability  that  a  project  of  lengthy 
run,  such  as  a  remote- job-entry  task,  can  be 
completed  without  being  interrupted  by  a  failure  at 
any  point  in  the  terminal -to-computer  link? 

d.  Will  adding  redundant  equipment  in  strategic  places 
change  reliability  and  improve  system  availability? 
If  so,  does  improvement  justify  the  cost  of  redun¬ 
dancy? 

The  search  for  hiqh  reliability  must  consider  the 

following : 

a.  Satisfactory  network  performance; 

b.  Minimum  capital  investment; 

c.  Improved  network  configuration; 

d.  Selection  of  reliable  equipment; 


e . 


Reduced  maintenance  cost. 


Within  the  context  of  systems  in  general,  including  data 
networks,  reliability  is  defined  as  the  probability  that 
no  failure  will  occur  within  a  given  time  period. 
Conversely,  unreliability  is  the  probability  of  failure 
within  a  given  time  period.  One  measure  of  the  re¬ 
liability  of  the  devised  system  can  be  evaluated  in 
terms  of  the  mean  time  between  failures,  or  MTBF.  The 
answer  to  the  question  on  terminal  availability  requires 
the  introduction  of  the  concept  of  mean  time  to  repair 
(MTTR)  or  more  specifically,  an  average  MTTR ,  embracing 
all  devices  in  the  link.  When  the  device  fails,  some 
time  will  pass  before  it  can  be  repaired  and  restored  to 
service.  The  longer  the  MTTR,  the  lower  the  availabil¬ 
ity  of  the  terminal  to  the  user.  The  MTTR  is  obtained 
from  operating  experience,  and  each  device  in  a  series 
link  will  have  its  own  MTTR  value. 

These  views  of  reliability  must  consider  the  following 
assumptions : 

a.  Equipment  "burn-in"  and  software  debugging  has  been 
completed  before  the  operating  time  period  (MTBF) 
begins . 

b.  The  operating  time  period  of  interest  never  extends 
beyond  the  useful  life  of  the  equipment  or  system. 
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c.  Equipment  failures  occur  at  random. 

d.  The  number  of  system  failures  in  a  given  time 
period  is  the  same  for  all  equally  long  periods. 

e.  The  equipment  operates  in  a  reasonable,  specified 
environment  in  a  specified  manner. 

The  MTTR  and  MTBF  parameters  are  available  from  most 
central  equipment  suppliers  on  a  unit-by-unit  basis. 
Other  essential  parts  of  the  system,  such  as  the  Data 
Transmission  Media  (DTM) ,  can  vary  greatly  and  will  be 
site  dependent. 

The  system  availability  can  be  considered  the  probabil¬ 
ity  of  the  system  working  at  any  given  time.  It  is  the 
percentage  of  the  actual  working  system  time.  This 
percentage  is  obtained  from  the  MTBF  and  MTTR. 
Availability  can  be  calculated  for  one  component,  a  part 
of  the  system  (two  or  more  components),  or  the  en* ire 
system.  However,  the  respective  MTBF  and  MTTR  must  be 


known . 


2.0  THE  DESIGN  CONSTANTS 


The  constants  of  the  application  must  be  gauged  before 
considering  any  of  the  numerous  possible  network  ar- 
range ments.  This  procedure  is  necessary  for  obtaining 
estimates  of  the  following  business  application  parame¬ 
ters: 

a.  Number  and  locations  of  the  processing  sites; 

b.  Number  and  locations  of  the  remote  terminals; 

c.  Information  flow  patterns  between  the  terminals  and 
processing  sites; 

d.  Types  and  transactions  to  be  processed; 

e.  Traffic  volumes  for  the  transaction  types,  which 
may  depend  on  the  type  of  network  configuration 
employed; 

f.  Urgency  of  the  information  to  be  transmitted  (when 
must  the  response  be  supplied  to  the  remote  sta¬ 
tion,  or  how  soon  is  the  data  file  required  at  the 
destination?) ; 


q.  Capacity  reserved  for  traffic  growth; 

h.  Acceptable  undetected  information  error  rates  (bit 
or  block) ; 

i.  Available  financial  resources; 

j.  Reliability  and  availability  requirements. 

The  oleography  and  performance  requirements  of  the 
network  must  be  defined  by  these  factors  before  any 
major  equipment  decisions  are  made.  Geographical 
separations  of  sources  and  sinks,  and  urgency  require¬ 
ments  of  the  messages  are  the  primary  basis  for  data 
communication  network  usage.  A  source  and  sink  are 
generalized  terminal  devices,  programs,  or  data  files 
which  serv  as  points  of  origination  and  destination, 
respectively . 

2 . i  NETWORK  TOPOLOGY 

/ 

In  communications  trades,  a  network  is  defined  as  a 
number  of  radio  or  television  broadcasts  stations 
connected  b^  coax  in .  cable,  radio,  or  wire  lines,  so 
that  all  stations  can  broadcast  the  same  programs 
simultaneously.  local  area  networks  (LANs)  are  usually 

f> . 


described  as  privately  owned  networks  that  offer  reli¬ 
able,  high-speed  communication  channels,  optimized  for 
connecting  information  processing  equipment  in  a  limited 
<~ieogiaphic  area  -  such  as  an  office,  building,  complex 
of  buildings,  campus,  and  the  like. 

LANs  are  unique,  because  they  can  be  designed  with  many 
techr ological  variations  arranged  in  many  different 
configurations.  Local  area  networks  are  best  defined  in 
terms  of  the  services  they  provide,  and  applications 
they  make  possible. 

A  network  topology  is  created  by  the  geometric  arrange¬ 
ment  of  the  links  and  nodes  that  make  up  a  network.  A 
link  (which  is  also  termed  a  line,  channel,  or  circuit) 
is  a  communications  path  between  two  nodes.  A  node  is 
generally  defined  as  a  terminal  in  an  electrical  network 
that  is  common  to  more  than  two  elements  or  parts  of 
elements  of  the  networks.  The  hardware  and  software 
chosen  for  a  particular  node  is  determined  by  the 
functioning  action  in  the  network. 

Physical  and  Logical  Links 

A  combination  of  physical  and  logical  connections  is  the 
basis  for  node  communication.  Llectro-mechanical 


circuits  between  nodes  (permanent  or  temporary)  consti¬ 
tute  these  physical  connections.  A  logical  connection 
infers  that  two  nodes  can  communicate,  regardless  of  the 
direct  physical  connection.  Figure  1  displays  physical 
connections  between  A-B,  A-D,  C-D,  C-B,  E-C,  and  E-D.  E 
can  not  communicate  with  A  or  B.  Node  A  could  communi¬ 
cate  with  E,  however,  if  these  nodes  have  the  ability 
for  routing  -  or,  passing  a  message  along  to  an  adjacent 
node . 

Point-To-Point  and  Multipoint  Links 

Two  types  of  links  exist  as  the  building  blocks  of 
network  topologies. 

A  circuit,  which  connects  two,  and  only  two  nodes  without 
passing  through  an  intermediate  node  represents  a 
point-to-point  link.  Figure  1  and  Figure  2  display 
various  configurations  of  these  point-to-point  lines, 
figure  l  displays  an  example  of  a  network  which  is 
becoming  increasingly  complex  and  expensive  as  full 
>  •<  .nno*-  f  ion  i  s  :;<  -ught . 

Ki'in:  '■  4  slv w-  an  t  v  ..iipie  of  how  point-to-point  links 
an  be  simplified;  however,  routing  decisions  must  now 
he  nade  when  messaoes  are  passed  on  to  other  nodes,  as 
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act  ive  component,  (such  as  a  repeater)  ,  adding  a  new 
node,  or  any  other  break  in  the  ring  configuration  will 
cause  the  network  to  stop  functioning  in  most  cases. 
Steps  can  tie  taken  to  al  low  bypass  of  failure  points  in 
distributed  rings,  although  this  usually  increases  the 
complexity  of  the  repeater  at  each  node,  as  well  as  the 
component  cost.  Failure  of  the  control  node  in  a 
centrally  controlled  ring  would  inevitably  lead  to 
netwoik  failure  as  well. 

2 . 4  BUS  NETWORKS 


The  bus  topology  functions  similarly  as  a  multipoint 
line  -  in  other  words,  a  single  point  which  is  shared  by 
a  number  of  nodes.  Refer  to  Figure  11,  Bus  Network. 

As  previously  discussed,  a  star  network  has  all  nodes 
fully  connected  with  point-to-point  links,  and  are 
joined  at  a  single  point  or  central  switch.  A  ring 
network  consists  of  separate  point-to-poi  it  links  that 
it"  fully  connected  in  a  circular  arrangement.  In 
contrast  to  these  type  of  topologies,  bus  nodes  share 
••••.’  physii  a  1  tiannei,  be  if  cable  taps  or  connectors. 

The  hits  :.o*  work  thus  creates  a  fully  connected  shared 
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FIGURE  10A  -  RING  NETWORK 


CONTROL  NOOf 


FIGURE  10B  -  RING  (LOOP)  NETWORK 
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Ring  networks  with  centralized  control  are  often  re¬ 
ferred  to  an  "loops”  (See  Figure  1  OB  for  a  common 
cord  iqurat ion)  .  One  of  the  nodes  (the  control  node) 
attached  to  the  network  controls  the  access  to  and 
communication  over  the  channel  by  the  other  nodes.  Once 
permission  is  given  by  the  control  node  to  transmit  a 
message,  the  communication  can  travel  around  the  ring  to 
its  destination  without  further  intervention  by  the 
control  node. 

Another  loop  design  has  all  message  exchanges  through 
the  central  node.  In  this  scheme,  the  lcop  network 
resembles  the  star  network . 

The  following  are  considerations  for  ring  network 
con  figurations: 

a.  Rings  must  be  physically  arranged  so  that  they  are 
fully  connected. 

b.  Fines  have  to  be  placed  between  any  new  node  and 
i*s  two  adjacent  nodes  each  time  an  addition  is 
made . 

The  i  ,i  jor  d  i  s .  1 1 : '  ■  ,i  n  t  age  of  i  loop  is  reliability.  It  is 
"<  >  v  vu  1 1  <  r  d>  1 1?  to  failure  of  the  interfaces  because  oi 
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ring.  Two  basic  conditions  must  be  met  for  each  node  in 
this  type  of  configuration: 


a.  Each  node  must  be  able  to  detect  its  own  addresses. 

b.  Each  node  must  be  able  serve  as  its  own  active 
repeater,  retransmitting  messages  to  addressed 
other  nodes. 

Retransmission  of  messages  can  make  ring  nodes  more  or 
less  complex,  dependent  upon  their  application,  since 
messages  automatically  travel  to  the  next  node  on  the 
ring. 

When  ring  configurations  are  used  to  distribute  and 
control  local  networks,  access  and  allocation  methods 
must  be  utilized  to  avoid  opposing  demands  for  the  share 
chanrel.  One  method  involves  circulating  a  bit  pattern, 
termed  a  token,  around  the  ring.  When  a  node  seizes  the 
token,  it  gains  exclusive  access  to  the  channel.  When 
the  node  is  finished  transmitting,  it  passes  the  right 
to  access  (i.e.,  the  token)  on  to  the  other  nodes. 

Rings  thus  provide  a  common  network  channel,  with  all 
node:;  being  logically  connected.  Under  distributed 
control,  each  node  under  its  own  initiative,  can  commu- 
nirale  directly  with  all  of  the  nodes. 
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sending  node  to  its  requested  destination  node.  The 
switch  may  give  a  "circuits  busy"  signal  to  the  node 
requesting  to  send.  When  the  available  pirt(s)  of  the 
destination  node  are  being  utilized,  the  switch  may 
issue  a  "station  busy"  signal  to  the  sending  node. 

There  exists  a  major  drawback  to  star  network  configura¬ 
tion.  Obviously,  the  central  node  remains  the  focal 
point  of  this  network.  All  of  the  system  will  go  down 
if  and  when  the  central  node  goes  down.  Thus,  the  need 
for  redundancy  and  reliability  are  self-evident  points 
of  interest. 

Time-sharing  applications,  with  the  central  node  serving 
as  time-sharing  host,  constitute  a  major,  utilized  form 
of  the  star  network.  Quite  common  is  the  PBX  (private 
branch  exchange)  telephone  network.  The  star  network 
also  manifests  itself  in  smal 1-cl utered  networks,  like 
word  processing  groups. 

3  RJNp_  ARC]1 1  TEC  TUKE 

Ring  topologies  are  typitied  with  point-to-point  links 
u:d  continuous  unbroken  circular  configurations. 
Transmitted  messages  travel  from  node-to-nodc  around  the 
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b.  Between  the  outlying  nodes; 

c.  and  from  all  nodes  to  remote  points. 

Consequently,  outlying  nodes  are  relieved  of  control 
f unct ions . 

Central  and  outlying  nodes  are  thus  connected  with 
point-to-point  lines.  This  type  of  configuration  forms 
a  low-cost,  simplier  connection  to  the  central  node. 
This  type  of  star  network  configuration  is  ideal  for 
communication  between  the  central  and  outlying  nodes. 

F  °ver,  if  traffic  is  heavy  between  the  outlying  nodes 
the  central  node  may  be  unduly  taxed. 

The  star  network  can  be  constructed  in  another  form, 
utilizing  the  outlying  nodes  as  the  controlling  node. 

In  this  scheme,  one  outlying  node  may  exercise  all 
control,  or  control  may  be  spread  generally  between  all 
nodes.  Regardless,  the  central  node  control  is  min¬ 
imized,  and  it  serves  as  a  switch  to  establish  circuits 
between  the  outlying  nodes. 

With  distributed  control,  a  method  is  necessary  for 
solving  conflicting  requests  for  connections  between 
nodes.  Circuits  may  not  be  available  to  connect  the 


FIGURE  0  -  STAR  NETWORK 


a.  Delays  in  communications,  due  to  nodes  which  make 
routing  decisions,  performing  more  network-related 
functions  than  what  is  desirable  for  LAN  nodes. 
These  delays  thus  result  in  more  overhead. 

b.  By  making  only  the  necessary  node  connections, 
deficiencies  and  savings  gained  are  not  as  applica¬ 
ble  to  local  area  networks. 

The  following  topologies  allow  more  effective,  uniform 
implementation  of  network  control  strategies.  They  are 
ring,  bus,  and  star  networks. 

STAR  NETWORKS 

The  distinguishing  characteristics  of  this  type  of 
topology  is  that  all  nodes  are  joined  at  a  single  point, 
as  illustrated  in  Figure  9.  Control  is  maintained  in 
basically  one  of  three  ways.  Frequently,  star  config¬ 
urations  are  utilized  for  networks  in  which  network 
control  is  located  in  a  central  node  or  switch.  All 
routing  of  network  message  traffic  is  done  by  the 
central  node: 

a.  From  the  central  to  outlying  nodes; 
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each  node  can  utilize  the  line  for  transmission  when  the 
line  is  free  of  "message  traffic".  This  type  of  mecha¬ 
nism  is  organized  with  a  set  of  rules  implemented  in 
each  node.  Refer  to  Figure  7. 

Unconstrained  Topologies 

Unconstrained  topologies  are  also  termed  hybrid  or  mesh 
network  configurations.  These  configurations  are  of  a 
general  nature,  and  the  actual  connections  made  will 
determine  the  configurations  shape.  The  variations  from 
one  implementation  to  another  can  be  significant. 

Network  economics  usually  determine  the  connections. 
Efficiency  is  best  achieved  by  selecting  only  the 
necessary  connections. 

Combinations  of  point-to-point  and  multipoint  links 
utilizing  routing  and  non-routing  nodes  are  good  exam¬ 
ples  of  unconstrained  topologies.  These  types  are 
well-suited,  and  commonly  used  for  long  haul,  pack¬ 
et-switch  networks.  Undesirable  characteristics  do 
exist;  they  include. 


1  4  . 


MATH* 


TNllUTAftff  S 


FIGURE  6  -  CENTRALIZED  CONTROL 
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FIGURE  7  -  DISTRIBUTED  CONTROL 


FIGURE  8  -  UNCONSTRAINED  TOPOLOGY 


An  example  of  this  type  of  node  is  a  dedicated  commu¬ 
nications  processor  or  switch. 

With  distributed  control  nodes: 

a.  establish  connections;  or 

b.  Access  the  network  channel  independently . 

In  some  instances,  all  nodes  have  an  equal  chance  to 
utilize  the  network  to  communicate. 

Figure  6  displays  an  example  of  centralized  control.  A 
master  node  centralizes  control  of  the  multipoint  line. 
The  other  nodes  which  utilize  the  line  are  termed 
tributaries,  and  they  are  controlled  by  the  master  node 
All  messages  to,  from,  and  between  these  nodes  must  pas 
through  the  master.  Each  node  is  queried  in  the  order 
specified  by  an  internal  list  to  determine  who  will 
transmit.  Which  node  receives  is  determined  by  select¬ 
ing  each  node  through  its  address. 


well  as  recognizing  and  accepting  messages  addressed  to 
themselvei .  The  latter  is  also  retained  in  this  routing 
process.  In  Figure  4  intervening  nodes  must  now  be  employed. 
Figure  5  displays  a  multipoint  or  multidrop  link.  Basically, 
this  type  of  link  is  a  single  line,  which  is  shared  by  more 
than  two  rodes.  Multipoint  lines  minimize  the  number  of 
lines  required  for  node  connection  and  line  cost.  Thus,  the 
six  nodes  pictured  in  Figure  5  can  communicate  by  sharing  a 
single  multipoint  line.  Nodes  on  this  type  of  line  are 
generally  more  complex  than  simple  point-to-point  nodes. 

They  must  handle  messages  based  on  their  addresses,  similar 
to  routing  nodes.  Access  to  the  line  must  be  controlled  by 
some  methed  to  avoid  usage  conflicts,  since  this  line  is 
shared  by  a  number  of  nodes. 

Topology  and  Channel  Control 

Network  designers  must  decide  if  control  of  a  network  is 
to  be  centralized  or  distributed.  With  centralized 
control,  one  node  controls: 

a.  Access  to  the  network;  or,  which  nodes  can  send 
messages,  and  when. 

b.  Allocation  of  the  channel  (how  much  of  a  channel 
can  a  node  use,  and  for  how  long?) . 
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FIGURE  2  -  POINT-TO-POINT,  VARIOUS  DISPLAYS 


FIGURE  3  -  POINT-TO-POINT,  COMPLEX 


FIGURE  4  -  POINT-TO-POINT,  ROUTING 


FIGURE  6  -  MULTIPOINT  OR  MULTIDROP  LINK 


FIGURE  1  -  PHYSICAL  ANO  LOGICAL  LINKS 
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FIGURE  11  -  BUS  NETWORK 
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The  bus  topology,  like  the  ring,  has  most  frequently 
been  used  for  distributing  control  on  local  area  net¬ 
works.  Messages  placed  on  the  bus  are  broadcast  out  to 
all  networks,  and  are  able  to  recognize  their  own 
addresses  in  order  to  receive  transmission.  However, 
unlike  nodes  in  a  ring,  they  do  not  have  to  repeat  and 
forward  messages  intended  for  other  nodes.  As  a  result, 
there  is  none  of  the  delay  and  overhead  associated  with 
retransmitting  messages  at  each  intervening  node,  and 
nodes  are  rel  ieved  of  network  control  responsibility  at 
this  level. 

One  distinct  advantage  of  this  type  of  network  architec¬ 
ture  is  high  reliability.  Network  operation  will 
discontinue  in  the  event  of  node  failures,  due  to  the 
passive  role  nodes  play  in  transmission  on  the  bus. 
Distributive  bus  networks  are  thus  inherently  resistant 
to  single  point  failures.  Other  advantages  include  easy 
configuration  and  expansion  in  most  physical  layouts 
(room,  building,  or  building  complex). 

Centrally  controller!  bus  topologies  are  possible,  but 
are  not  common,  for  LANs.  They  would  resemble  a  multi¬ 
point  line1  with  a  muster  and  tributary  nodes.  As  in  all 
care:  of  centralized  control,  the  master  node  would 
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communicate  in  one  of  the  most  attractive  features  of 
the  bus  topology  and  local  area  networks. 

There  are  several  considerations  for  the  design  of  bus 
networks.  Components,  such  as  transmitters/receivers , 
must  be  designed  for  reliable  and  maintainable  opera¬ 
tion.  Also,  owing  to  the  potential  difficulty  in 
locating  faults  on  a  bus,  network  management  capability 
or  test  equipment  must  allow  fault  detection  and  iso¬ 
lation  to  facilitate  repair  and  maintenance. 


2 . 5  NETWORK  REDUNDANCY 


Redundancy  is  defined  as  "any  deliberate  duplication  or 

partial  duplication  of  circuitry  or  partial  duplication 

of  circuits  or  information  systems  on  communication 
2 

failure."  Redundancy  is  an  effective  approach  to 
dramatically  increase  the  reliability  of  a  system  or 
network.  The  provision  and  maintenance  of  equipment 
represents  the  cost  factor.  However,  hardware  costs 
continue  to  decrease  -  consequently,  this  is  becoming  a 
progressively  less  important  factor,  particularly  in 
relation  to  the  losses  incurred  whenever  the  system  is 


Ibid,  p.  1339 
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inoperable.  The  redundant  equipment  also  provides  added 
capability,  the  attendant  reduction  in  sparing  require¬ 
ments,  and  the  relaxation  of  maintenance  response  time 
requirements.  All  of  these  factors  can  be  realized  with 
redundant  systems,  thereby  reducing  maintenance  costs. 

In  the  final  analysis,  redundancy  may  be  considered  a 
viable  alternative  to  meeting  reliability  requirements. 
The  following  are  factors  which  should  be  considered  and 
eval uated : 

a.  Reliability  should  be  significantly  increased.  If 
a  unit's  MTBF  is  substantially  greater  than  that  of 
another  non-redundant  system  element,  there  is  no 
advantage  to  duplicating  the  unit. 

b.  The  pi  art ical ity  of  the  unit  should  be  evaluated. 
Problems  may  arise  with  incorporating  a  second  unit 
in  the  system,  additional  multiplexing  may  be 
required,  or  switching  hardware  may  be  so  unreli¬ 
able  that  little  gain  is  realized. 

c.  Redundancy  should  be  oost-ef fective .  The  cost  of 
providing  redundancy  should  be  weighed  against  its 
benefits.  If  no  specific:  reliability  requirement 

r  ■  >:  i  s  t  : ;  beyond  a  desire  that  it  could  be  maximized 

?(.  . 


.0  RELIABILITY 


Reliability  (previously  discussed  in  Section  1)  may  bo 
mathematically  approximated  for  a  device  or  system  as: 

R  (t)  =  e-Xl 

Here,  e  denotes  the  base  of  the  natural  logorithm 
(2.7183),  X  is  a  constant  termed  the  average  failure 
rate,  and  t  is  the  time  instant  for  which  the  device 
reliability  is  desired. 

A  more  convenient  form  of  this  equation  is: 

R  (t)  -  exp  (-  X  t)  (1) 

Here,  exp  stands  for  exponential.  The  constant  for  the 
average  failuie  rate,  X  ,  is: 

A  ~  1 / (MTBF )  (2) 

Thus,  Equation  1  states  that  as  MTBF  increases,  the 
probability  of  failure  decreases,  and  the  average 
duration  oi  failure-tree  operation  increases. 
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Sample  calculation:  What  is  the  probability  that  a 
device  will  not  fail  in  500  hours  when  its  MTBF,  as 
detei mined  from  operating  experience,  is  1,000  hours? 

Becai se  the  device  fails  on  the  average  once  every  1,000 
hours,  A  is  1/1,000.  Using  Equation  1  with  t  =  500, 
then : 

R  (t)  =  exp  (-500/1,000)  =  0.607 

Values  of  the  exponential  expression  can  be  readily 
determined  from  exponential  tables  (Refer  to  Table  1)  or 
on  a  scientific  calculator,  or  (without  too  much 
precision),  from  the  graphs  in  Figure  12. 

The  interpretation  of  the  value  C.607  is  that  there  is  a 
slightly  better  than  60%  chance  that  the  previously 
mentjoned  device  with  an  MTBF  of  1,000  hours  will  run 
for  5  00  consecutive  hours  without  a  failure  -  with  the 
500  hours  starting  at  any  arbitrary  instant.  Converse¬ 
ly,  there  is  approximately  a  40%  probability  the  device 
will  experience  a  failure  during  an  arbitrary  time 
period  of  500  consecutive  hours. 


e  x 


Connection  Types 


In  order  to  effectively  judge  the  reliability  of  an 
end-to-end  link,  in  a  typical  data  communications  network 
(e.g.  a  remote  terminal  and  a  host  computer) ,  a  lxnk 
must  first  be  defined  as  a  series,  parallel,  or  se¬ 
ries/parallel  connection  (Figure  13) . 

A  typical  series  link  is  shown  in  Figure  14.  It  in¬ 
cludes  every  device  and  line  from  the  remote-job  entry 
(RJE)  terminal  to  the  central  processing  unit  (host 
computer).  To  find  the  link's  reliability,  the  equiva¬ 
lent  average  failure  rate  of  the  complete  link,^,  must 
be  computed  and  inserted  into  Equation  1. 

In  a  series  connection,  a  failure  in  any  device  in  the 
link  will  put  the  entire  connection  out  of  action.  That 
is : 


R.  =  P, XR_XR, . . . R 
l  l  i  3  n 


(3) 
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FIGURE  13  -  BASIC  LINK  TYPES 


wt'K-ri'  I'  j  link  reliability.  Furthermore,  the  equiva¬ 
lent  average  failure  rate  of  the  link  is  the  sum  of  the 
failure  rates  of  the  individual  devices: 

Ai  =  Af  +A2  +A3  t . . .  +  An 

Therefore : 

MTBf  ,  i /  (A  +  A  +  A,  +...  A  ) 

J  n 

The  series  link  shown  in  Figure  14  consists  of  six 
devices  and  elements:  the  RJE  terminal  (I) ,  the  termi¬ 
nal  modem  (G) ,  the  line  (E) ,  the  central  site  modem  (C) , 
the  commun i cat  ions  front  end  (B)  ,  and  the  central 
processing  unit  (A).  A  failure  in  any  of  these  elements 
affects  the  KJE-terminal  user  (I). 

Equation  5  yields  the  mean  time  between  failures  as  seen 
by  user  !.  That  is: 

MTBFj  =  1  /  (A,  Ag  A0  Ac  Ab  Aa  (6) 

where 

A.  -  failure  rate  of  the  RJE  terminal 
A  -  failure  rate  of  the  RJE  terminal's  modem 

g 

A  =  failure  rate  nr  the  transmission  line  from 

e 

the  RJE  terminal 

A 

c  failure  rate  of  the  central-site  modem 


(4) 

(5) 


s.iivinq  the  PJE  terminal 


A.  -  failure  rate  of  the  communications  front 
a 

end  processor 

Aa=  failure  rate  of  the  central  processing  unit 

When  the  hypothetical  failure-rate  data  contained  in 
Table  1  is  inserted,  the  equivalent  average  failure 
rate  as  seen  by  the  RJE-terminal  user  works  out  at: 

MTBFX  =  1 (  (1+0 . 2  +  2  +  2  +  0 . 2  +  5+10)  X  10-3 1 
=  54.3  hours 

Therefore,  the  reliability  for  this  link  is: 

R(t)  =  exp (-t/MTBFj )  =  exp(-t/54 . 3 )  (7) 

The  equations  and  calculations  to  this  point  can  be 
utilized  in  determining  these  important  considerations: 

a)  What  is  the  probability  that  the  RJE-terminal  user 
can  transmit  a  two-hour  job  to  the  host  computer 
without  a  link  failure  during  that  period? 

b)  What  is  the  availability  of  the  remote-job  entry 
terminal  to  the  user? 


DEVICE  OR  SU8SVSTEM 


REMOTE  JOB  ENTRY  TERMINAL  II) 
UOOt*  AT  RJE  SIT£  IC> 

LINE  TO  RJE  SITE  (Cl 
MODEM  AT  CENTRAL  SITE  (Cl 
COMMUNICATIONS  T  RONT  [NO  If) 
CCU  IMARDWARE  SOT  TW  ARE  I  (A) 


I —  TABtr  i 

FAILURE  RATE  MEAN  TIME 

(PER  1.000  HOURS  TO  REPAIR 

OF  OPERATION)  (HOURS) 
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the  need  for  human  involvement  in  the  tedious  tasks  of 
network  troubleshooting.  Included  in  these  diagnostics 
will  be  those  for  analog  line  impairments, 
bit-error-rate  tents  including  bit-pattern  generators, 
and  protocol  tests. 
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The  only  real  options  open  to  designers  who  want  to 
increase  system  availability  are  to  select  reliable 
equipment  in  the  first  place  and  to  install  redundant 
equipment  and  lines  wherever  the  improvement  in  network 
availability  outweighs  the  penalty  of  extra  cost. 
Reliability  is  up  to  the  vendor.  That  is,  only  in  rare 
instances  will  the  user  work  with  the  vendor  to  upgrade 
the  reliability  of  the  equipment.  However,  the  user  can 
look  into  competitive  equipment  and  talk  with  people  who 
have  installed  the  equipment  in  which  he's  interested. 

Since  system  availability,  in  terms  of  usable  terminals, 
depends  on  both  the  reliability,  or  MTBF,  and  the 
downtime  as  measured  by  mean  time  to  repair,  or  MTTR , 
users  can  overcome  the  consequences  of  marginal  re¬ 
liability  by  speeding  up  fault  isolation  and  diagnosis. 
This  is  the  reason  for  the  current  strong  trend  toward 
installing  a  full  range  of  diagnostic  features  as  an 
integral  part  of  the  data  communications  network.  These 
diagnostic  capabilities  have  already  appeared  in  hard¬ 
ware  form. 

In  the  future  greater  emphasis  will  be  placed  on  diag¬ 
nostic  routines  driven  by  software  which  will  minimize 


B,.  -  exp  (-200/200)  =  0.368 

Therefore,  the  equivalent  reliability  is: 

(0.819)  (0.67  0)  (0.697)  (0.697)  (0.368)  =  0.0981 

Thus,  there  is  a  slightly  less  than  10  percent  chance 
that  the  connection  will  be  sustained  without  failure  in 
the  first  200  hours  of  operation. 

Here,  MTBFj  -  1,000,  MTBF2  =  500,  and  MTBF5  =  200  and, 
from  Table  2: 

MTBF3  =  3/2A3  =  3,000/8  =  375 

and 

MTBF4  =  3/2  X4  -  3,000/8  =  375 

Therefore,  from  Equation  5: 

MTBF  2 _ 1 _ 

1  +  1  +  1  +  1  +  1  =  75  hours 

1,000  500  375  375  200 
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combination.  The  equations  for  series  connections  can 
be  used  to  determine  the  reliability  and  equivalent  MTBF 
for  a  series/parallel  connection  (Figure  13C) . 

Using  the  hypothetical  data  contained  in  Table  3  for  the 
configuration  in  Figure  15,  compute  the  equivalent  MTBF 
and  reliability  ( Re )  as  seen  by  the  user  of  the  termi¬ 
nal,  T,  for  an  interval  of  t~  200  hours.  Here,  treating 
the  configuration  as  if  it  were  a  scries  connection  only 
and  using  Equation  3: 

R  =  R,  X  R„  X  R,  X  R.  X  Rc 

e  1  2  3  4  5 

where  Rj  is  terminal  reliability,  R?  is  line  reliabil¬ 
ity,  R^  the  reliability  of  the  parallel  controllers,  R4 
the  reliability  of  the  parallel  CPUs,  and  R,.  the  stor¬ 
age-system  reliability.  Thus,  at  t=  200: 

R}  =  exp(-200/l ,000)  =  0.819 

R2  =  exp (-200/500 )  =  0.670 

R3  =  2exp(-200/250) -exp (-400/ 250 )  =  0.697 

=  2exp (-200 / 250 ) -exp (-400/ 250 )  = 


0.697 


Of  course,  the  key  practical  problem  for  data  communica¬ 
tions  users  is  their  lack  of  ability  to  control  these 
MTBF  variables  foi  inaividual  devices. 

Table  2  contains  the  equations  for  the  equivalent 
reliability  of  serverai  types  of  parallel  configuration. 
Here,  the  equivalent  subsystem  can  be  treated  mathemat¬ 
ically  as  being  one  element  in  a  series  connection,  even 
though  the  actual  equipment  is  linked  in  parallel.  This 
table  also  contains  the  MTBF  of  the  equivalent  subsys¬ 
tem.  Thus,  for  the  preceding  case,  the  appropriate 
equation  is  3/2  A  ^  (both  communications  front  ends  have 
the  same  500  hour  MTBF).  Therefore,  the  net  MTBF  is: 

3  x  500/2  -  750  hours 

That  is,  even  if  one  communications  front  end  fails  at 
the  end  oi  500  hours  and  is  replaced  by  the  hot  standby 
unit,  statistically  the  combination  will  last  another 
350  hours  -  during  which  the  network  remains  operational 
while  the  failed  front  end  is  being  repaired. 

ter ies/Paral 1 e 1  Conn ect ions 

in  practice,  if  redundant  equipment  is  used  at  all,  the 
actual  coni  i  gurat  i  oil  will  be  a  series/parallel 
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For  t  =  500  hours,  then: 


R  =  2  exp(-l)  -  exp(-2)  =  0.f>01 

It  is  important  to  note  that  since  the  function  rt  is 
not  purely  an  exponential,  the  comparison  of  reliability 
is  only  valid  for  the  first  500  hours  -  not  an  arbitrary 
500  hours. 

By  comparison,  one  CPU  having  an  MTBF  of  500  hours 
yielos  a  reliability  of: 

R  =  exp (-500/500 )  =  0.368 

while  one  CPU  having  an  MTBF  of  1,000  hours  has  a 
reliability  of: 

R  =  exp  (-500/1  ,000)  =  exp  (-0.5)  =  0.606 

For  the  situations  discussed  here,  the  redundant-CPU 
connection  definitely  improves  reliability,  but  at 
substantially  double  the  cost.  However,  taking  steps  to 
improve  the  MTBF  of  a  single  CPU  from  500  hours  to  1,000 
hours  will  probably  be  less  costly  and  yet  will  provide 
the  same  reliability  as  a  redundant  configuration. 
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Suppose  a  computer-based  coinmun ications  system  uses 
redundant  central  processing  units,  each  with  an  MTBF  of 
50  0  hours'.’ 

What  is  the  probability  that  the  parallel  CPU  com¬ 
bination  will  operate  (not  fail)  for  500  hours? 

How  does  this  performance  compare  with  the  reliability 
when  using  just  one  CPU?  What  is  the  net  mean  time 
between  failures  of  the  parallel  combination? 

Using  Equation  1: 

Pj  =  exp(-u/500)  and  =  exp  (-t/500) 

There  f ore : 

Pj  -  1  -  1  exp ( - 1/ 500 ) )  and  =  1 - [ exp (-t/ 500 ) ] 

Hence,  using  Equation  8: 

R.  =  I  1 -oxp (-t/500 )  )  [ l-exp(t-/500] 

R  -  2  exp  (  - 1  /  ‘>0  0  )  -  exp(-2t/500) 


Here,  the  reliability  of  the  entire  series  connection  is 
obtained  simply  by  summing  the  individual  failure  rates 
of  each  device. 


The  use  of  redundant  or  hot  standby  devices  and  lines  is 
particularly  common  in  computer-based  data  communica¬ 
tions  systems.  Two  devices  -  perhaps  two  communications 
front  ends-are  placed  in  parallel  with  each  other,  but 
only  one  device  has  to  be  on  line  for  the  network  to  be 
operational.  If  the  operating  unit  fails,  then  the 
standby  unit  is  promptly  placed  on  line.  A  diagram  of  a 
generalized  parallel  connection  is  contained  in  Figure 
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which  equals  3.1  hours. 

Refoi  to  Table  2  and  other  standard  reliability 
handbooks  which  define  system  avail  ability , a ,  as: 

a  =  (MTBF) / (MTBF  +  MTTR) 

so  that,  for  the  situation  in  the  series-connection 
numerical  example: 

a  =  (54.3) / (54.3)  +  3.1)  =  0.946 

Consequently,  the  user  of  the  RJE  terminal  can  be  sure 
that,  on  average,  the  terminal  will  be  available  for 
communications  with  the  host  computer  946  out  of  every 
1,000  operating  hours,  and  that  once  the  operator  starts 
a  two-hour  job  the  run  will  continue  to  completion  96% 
of  the  attempts. 

An  alternative  and  perhaps  more  direct  way  of  calculat¬ 
ing  the  reliability  of  a  series  connection  is  to  use  an 
equivalent  relationship  derived  from  Equations  1,  2,  3, 
and  4A,  namely: 

R(t)  =  expf-Aj  +A2  +  A  3  +...An)t]  (4D) 
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The  average  value  of  the  MTTR  is  the  sum  of  the  indi¬ 
vidual  devioes'  MTTRs,  with  each  MTTR  multiplied  by  its 
own  failure  probability.  That  is: 


AVG  MTTR  =  (MTTRi) ( P± ) 

The  sum  of  the  probabilities  of  failure  of  the  devices 
in  the  link  must  add  up  to  unity.  To  find  the  individu¬ 
al  failure  probabilities  requires  the  mathematical  step 
called  normalization. 

That  is: 

n 

p .  =  A,  /  I  A 
1  'it1 

A2MTTR2  AnMTTRn) 

A2  A  n  ) 

Using  the  failure-rate  and  MTTR  data  contained  in  Table 
1  for  the  link  shown  in  Figure  14,  then  average  MTTR  as 
seen  by  the  user  of  the  RJE  link  is: 

[  (1)  (3)  + (0.2)  ( ?.5)  +  U)(4H(0.2)(2.5)-H5H5)M10)(2niP~3 

1 8 . 4x 1 0  _  3 


There  f  ore : 


AVG  MTTR 


(A,  MTTR, 

(A, 


The  answer  to  the  first  question  can  be  obtained  from 
Equation  7,  using  t  -  2  hours: 

R  ( t )  exp  (-2/54.3)  =•  0.9625 

This  interpretation  says  that  96  out  of  every  100 
two-hour  job  attempts  will  be  processed  without  link 
failure.  However,  four  times  out  of  a  hundred,  the  job 
will  be  aborted  by  an  individual  failure. 


Mean  Time  To  Repair 


The  answer  to  the  question  on  terminal  availability 
requires  the  introduction  of  the  concept  of  mean  time  to 
repair  (HTTR)  or,  more  specifically,  an  average  MTTR 
embracing  all  the  devices  in  the  link.  When  a  device 
fails,  some  time  will  elapse  before  it  can  be  repaired 
and  restored  to  service.  The  longer  the  MTTR,  the  lower 
the  availability  of  the  terminal  to  the  user.  The  MTTR 
is  obtained  from  operating  experience,  and  each  device 
in  a  series  link  will  have  its  own  MTTR  value.  Average 
MTTR,  then,  is  one  value  for  the  link  that  takes  into 
account  all  the  individual  devices'  MTTRs. 
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CASE  STUDY 


ACTUAL  SYSTEMS 


The  individual  systems  to  be  studied  are  a  dual  loop 
with  redundant  front  end  modems  versus  a  radial  system 
with  8  points  of  connection  which  are  shown  in  Figures 
16  and  17 . 

The  ring  type  system  as  described  is  actually  a  bus 
configuration  arranged  in  a  loop,  since  each  node  has 
access  to  two  lines.  The  communication  lines  are  common 
to  all. 


The  radial  system  is  actually  a  parallel  system,  since 
each  node  has  its  own  front  end  modem.  Each  may  be 
re-configured  as  shown  in  Figures  18  and  19. 
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For  the  radial  or  parallel  system,  each  link  would  have 
the  following  points  (Eefer  to  Figure  20).  The  failure 
rate  for  the  link  would  be  the  sum  of  the  failure  rates 
for  each  individual  device: 


<Xi  +  X  2  +  *3  +  *4  +\  > 


The  MTBF  would  be:  1  /  (  X  j  +  X  2  +  X3  +  X4  +  ) 


substituting  hours  /  1000  hours  operation  from  Table  1: 

MTBF  =  1/  ( . 2  +  2.0  +  5.0  +  2.0  +  10.0)  = 

1/19.2  =  .052  x  10~3  =  52.1  Hrs. 

The  reliability  for  this  link  v/ould  be: 

R (T)  =  exp(-t/52.1)  -  for  any  time  period  (t) ; 

R  ( 200 )  =  exp  (-200/52/1)  =  .0215  =  2.15  in  200  hours; 
the  average  MTTR  for  this  configuration  would  be 

(.2)3.0+ (2. 0)4.0+ (5. 0)5.0+ (2. 0)4.0+ (10. 0)2.0  =  61J>  =  1.54 


.  2x2x5  v.  2x10 


40.0 


MTBF 


52.1 


the  availability  would  be 

MTBF  +  MTTR  52.1+1.54  -  .9713 

Since  each  link  is  connected  to  the  CPM  through  a  modem, 
the  reliability  of  each  line  would  be  on  availability  of 
97.13%,  with  no  factor  for  the  other  links  in  the 
system.  The  failure  of  any  other  link  would  not  affect 
the  operation  of  the  rest  of  the  system. 

For  the  bus  system,  each  line  would  have  the  following 
points  (Refer  to  Figure  21).  All  nodes  are  in  parallel. 
The  failure  rate  for  each  remote  unit  would  be  the  same 
as  that  for  a  semi-connection,  or  a  radial  system  as 
previously  outlined.  However,  since  all  eight  are  in 
parallel  on  the  same  bus,  a  bus  failure  would  cause  a 
system  failure.  The  addition  of  the  second  bus  in  a 
redundant  configuration  would  give  an  increased  re¬ 
liability: 

3/2  A  =  3/2  (52.1)  =  78.1  Hours  (MTBF) 

R (T)  =  eyp  (-T/MTBF) 

R  (  200 )  =•  exp  t  —  200/78 . 1)  = 


r»f.. 


.07724  =  7.7%  in  200  Hrs 


Summary  Results  are  as  follows: 


Radial  System:  MTBF  =  52.1  Hrs;  R(200)  =  2.15% 


Ring  System:  MTBF  =  78.1  Hrs;  R(200)  ~  7.7% 


CONCLUSIONS 


While  the  numbers  indicate  that  the  redundant  loop  is 
the  more  reliable  of  the  two  some  practical  consid¬ 
erations  need  to  be  considered. 

A.  Both  of  the  loops  would  probably  be  j  >st alied  in 
the  same  bundle.  A  mechanical  or  1  ightning-ca’-sed 
problem  would  affect  both  loops. 

B.  Damage  due  to  mechanical  problems  almost  always 
causes  a  short  in  the  cable,  thus  rendering  the 
entire  loop  inoperative.  An  advantage  is  to 
isolate  the  faulted  section  and  allow  the  loop  to 
feed  in  both  directions.  The  entire  system  would 
be  down  until  the  gault  was  isolated. 

C.  A  severe  lightning  strike  on  one  loop  would  more 
than  likely  destroy  a  large  portion  of  the 
electronic  connection. 

D.  The  radial  system,  if  all  lines  are  kept  separate, 
would  practically  eliminate  a  total  system  failure. 
The  disadvantage  is  that  the  one  link  that  has  been 
damaged  would  remain  out  of  commission  until 
repa’rcd,  with  no  way  to  bypass  the  fault. 
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